A short description about the course and a link to my GitHub repository.
# "R chunk" for R code.
date()
## [1] "Thu Dec 09 16:07:00 2021"
How are you feeling right now? What do you expect to learn? Where did you hear about the course?
Hello!
My name is Janina and I am at the moment working under INAR (Atmospheric and Earth System Research) in Kumpula. I have been studying meteorology and I finished my master’s studies couple days ago! Now I am applying to Doctoral Studies with a topic of climate education. As I am moving towards education and educational research I am in need of R and statistical skills. One of my colleagues from Faculty of Educational studies suggested this course.
I am feeling quite excited. At first I was stressed to get back to studying but I bet everything goes well.
Have a lovely day,
Janina
Describe the work you have done this week and summarize your learning.
Let us look closer to the data - collected in 2014/15 in Finland from a survey about learning of students during a statistics course - resulting from the Data Wrangling.
date()
## [1] "Thu Dec 09 16:07:00 2021"
#Reading the data
learning2014 = read.table(file="learning2014.txt", sep=",", header=TRUE)
#Structure and dimensions of the data
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
The structure of the data looks good. There are 166 observations of 7 variable. The variables are gender, age, attitude, deep (learning), stra(tegic approach), surf(ace approach) and points.
‘Deep’, ‘stra’ and ‘surf’ combine 8 subscales, all including 4 questions:
‘Attitude’ includes 10 questions:
The values have been scaled by taking the mean. All the observations with points = 0 have been excluded.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# Creating a plot matrix with ggpairs(), colouring has been set by gender (M=blue, F=red)
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# Drawing the plot
p
So here we have a large plot matrix of all the 7 variables. Red colour indicates female answerers and blue men answerers. It is worth noticing that there are nearly twice as many female answerers as men and that majority of the students are 20 to 30 years old.
In the column most left we have the distribution of the observations presented in a horizontal way. The same distributions are presented also in a more visual way diagonally. The distributions of age and points are relatively similar between female and men answerers and the most differences can be found from distributions of attitude (female attitude more evenly distributed between answers of 1-4, men having a higher peak in attitude (3-4)) and surface approach (female having a higher peak (~3)). This could indicate that men had better attitude and motivation than female answerers.
Below this diagonal, we have scatter plots between two variables and above the diagonal we have correlations between two variables in a numerical way. The correlation coefficient, corr, refers to Pearson correlation and the values vary between -1 and +1. The closer the value of correlation is to |1|, the higher is the dependence between these two variables. In case the correlation is negative, the dependence is inverse: (the value of one variable (y) decreases when the value of the other variable (x) increases).
The highest correlation is between exam points and attitude (0.437). The next highest correlation is negative, and it is between surface approach and deep learning (-0.324). As the subscales of surface approach were mostly negative; lack of purpose, unrelated memorizing, syllabus-boundness, it does make sense that the correlation between deep learning and surface approach is negative. The weakest correlation (-0.0101) is between deep learning and exam points. It is interesting that these two do not correlate. Could this mean that the deep learning approach - seeking meaning, relating ideas, use of evidence - does not lead to learning that could be measured with exam points? Or is the exam measuring something else than deep learning?
Linear regression and correlation
Before going to the multiple regression, we will be looking at three chosen variables related to the exam point separately. The chosen explanatory variables are attitude, deep and surf. The significance will be considered separately and taken into account when moving to multiple regression.
#y ~ x
#y - Target (dependent) variable: Exam points
#x - Three variables as explanatory variables: x1 -> attitude (highest correlation), x2 -> deep (lowest correlation), x3 -> surf (most negative correlation)
EXAM POINTS ~ ATTITUDE
#Creating a scatter plot y ~ x1 (positive correlation)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Here we have a linear regression between attitude (explanatory variable) and exam points (dependent variable). The fitted regression line fits well and is directed upward, as there is a positive correlation between the variables.
The positive regression coefficient can be numerically observed from the column ‘Estimate’ (summary of the lm-model below). Regression coefficient measures the change in the mean response related to a unit change in the corresponding explanatory variable when values of all the other explanatory variables remain unchanged. The next column, ‘Standard Error’ describes how far the residual points are from the regression line, ergo the difference between an observed value of the response variable and the fitted value. The residuals estimate the error terms in the model, the smaller the value, the less there is error.
The significance can be observed using p-value (Pr(>|t|)). The lower (closer to zero) the p-value is the less probability there is that the correlation (or coefficient of the regression model) would be zero. In case the p-value exceeds 0.05 the relation between explanatory and target variable has no significance, as there is over 5 % risk for the coefficient to be zero.
Let us look at the p-value of our case. One can see that the value is really small (4.12e-09 << 0.05). This indicates that there is a statistical relationship between attitude and exam points.
#Creating a regression model with the explanatory variables x1
my_model1 <- lm(points ~ attitude, data = learning2014)
#Printing out a summary of the model
print(summary(my_model1))
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
EXAM POINTS ~ DEEP LEARNING
#Creating a scatter plot y ~ x2 (nearly non-existing correlation)
qplot(deep, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Here we have a linear regression between deep learning (explanatory variable) and exam points (dependent variable). The fitted regression line is nearly horizontal, which indicates that there is a weak correlation (if any).
Now, looking at the p-value (Pr(>|t|)) shows that the value is fairly large (0.897 >> 0.05). This indicates that there is no statistical relationship between deep learning and exam points.
#Creating a regression model with the explanatory variables x2
my_model2 <- lm(points ~ deep, data = learning2014)
#Printing out a summary of the model
print(summary(my_model2))
##
## Call:
## lm(formula = points ~ deep, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6913 -3.6935 0.2862 4.9957 10.3537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.1141 3.0908 7.478 4.31e-12 ***
## deep -0.1080 0.8306 -0.130 0.897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared: 0.000103, Adjusted R-squared: -0.005994
## F-statistic: 0.01689 on 1 and 164 DF, p-value: 0.8967
EXAM POINTS ~ SURFACE APPROACH
#Creating a scatter plot y ~ x3 (negative correlation)
qplot(surf, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Lastly we have a linear regression between surface approach (explanatory variable) and exam points (dependent variable). The fitted regression line is negative. Thus, the correlation between variables is inverse.
Let us check again the p-value (Pr(>|t|)). We can see that the value is just outside the level of significance (0.0635 > 0.05). This indicates that there is no statistical relationship between surface approach and exam points. However, the p-value is so close to the level of significance that in some cases it could be taken into account (this needs to be done with careful consideration).
#Creating a regression model with the explanatory variables x3
my_model3 <- lm(points ~ surf, data = learning2014)
#Printing out a summary of the model
print(summary(my_model3))
##
## Call:
## lm(formula = points ~ surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6539 -3.3744 0.3574 4.4734 10.2234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.2017 2.4432 11.134 <2e-16 ***
## surf -1.6091 0.8613 -1.868 0.0635 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.851 on 164 degrees of freedom
## Multiple R-squared: 0.02084, Adjusted R-squared: 0.01487
## F-statistic: 3.49 on 1 and 164 DF, p-value: 0.06351
To make the next task meaningful, I will be keeping two of the explanatory variables: attitude and surface approach. The latter does not have statistical significance but I would not be able to do the multiple regression, if I would be dropping it off.
Therefore, it stays.
As deep learning has no significance what so ever, it will be excluded from the next phase.
Below we have a summary of a multiple regression including all three variables. Already from the standard error we can see how the value is smallest for the attitude (strongest correlation) and highest for deep learning (weakest correlation).
The multiple R-squared is a square of the correlation coefficient between all three explanatory variables and exam points. The value in this case is 0.2024, meaning that slightly over 20% of the variation in exam points is explained by the variation in the three explanatory variables.
The value of multiple R-squared is known to increase when adding new explanatory variables, even if the added variable is not useful. Thereby, excluding a variable most likely will decrease the value of multiple R-squared. Thus, it may be useful to also look at the adjusted R-squared, as there the R-squared adjusts for the number of explanatory variables in a model. When all the variables are included the adjusted R-square is 0.1876 (~ 18.8 % of the variation explained).
#Creating a regression model with multiple explanatory variables
my_model4 <- lm(points ~ attitude + deep + surf, data = learning2014)
#Printing out a summary of the model
print(summary(my_model4))
##
## Call:
## lm(formula = points ~ attitude + deep + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9168 -3.1487 0.3667 3.8326 11.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3551 4.7124 3.895 0.000143 ***
## attitude 3.4661 0.5766 6.011 1.18e-08 ***
## deep -0.9485 0.7903 -1.200 0.231815
## surf -1.0911 0.8360 -1.305 0.193669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1876
## F-statistic: 13.7 on 3 and 162 DF, p-value: 5.217e-08
As it was pointed above, the deep learning has no significance when it comes to the exam points. Let us examine what happens when we exclude it from the multiple regression.
#Creating a regression model with multiple explanatory variables
my_model5 <- lm(points ~ attitude + surf, data = learning2014)
#Printing out a summary of the model
print(summary(my_model5))
##
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.277 -3.236 0.386 3.977 10.642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.1196 3.1271 4.515 1.21e-05 ***
## attitude 3.4264 0.5764 5.944 1.63e-08 ***
## surf -0.7790 0.7956 -0.979 0.329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 163 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1854
## F-statistic: 19.78 on 2 and 163 DF, p-value: 2.041e-08
As expected the value of multiple R-squared decreases, but only slightly. The new values (summary above) are;
There is only small decrease in the R-squared values after excluding deep learning. Thus, slightly less of the variance in exam points is explained by attitude + surf compared to the case of all three variables.
In addition to R-squared values we could also look at, for example, the p-value. When all the variables are included, the p-value is 5.217e-08 (first summary) and after deep learning has excluded the p-value is 2.041e-08 (second summary), ergo the significance has strengthen. Although, the significance is notable in both cases.
The small differences might be due to the fact, that the surface approach on its own does not have significance either. The R-squared value and p-value would be better when taking only attitude into account.
Next we will draw three diagnostic plots out of the multiple regression, which explanatory variables are above-mentioned attitude and surface approach The chosen diagnostic plots are Residuals vs Fitted, Normal QQ-plot and Residuals vs Leverage (table below, 1, 2, 5).
cheatsheet
| which | Graphics |
|---|---|
| 1 | Residuals vs Fitted values |
| 2 | Normal QQ-plot |
| 3 | Standardized residuals vs Fitted values |
| 4 | Cook’s distances |
| 5 | Residuals vs Leverage |
| 6 | Cook’s distance vs Leverage |
#Creating a regression model with multiple explanatory variables
my_model6 <- lm(points ~ attitude + surf, data = learning2014)
#Drawing diagnostic plots using the plot() function. Plots 1, 2 and 5 chosen
par(mfrow = c(2,2))
plot(my_model6, which = c(1,2,5))
The first diagnostic plot (Residual vs Fitted) shows if residuals have non-linear patterns or not. Let us see how well the scattered residuals distribute around the fitted regression line. As the dots are evenly distributed throughout the horizontal line, the fitted model seems to be appropriate with no distinctive pattern. Only couple of the observations are standing out on the negative side of the red line. These points are extreme points and have been numbered (145, 56, 35).
The second diagnostic plot (Normal Q-Q) is for checking if the residuals are normally distributed. As the residuals are mainly well lined with the straight dashed line we may say that the residuals are normally distributed. However, the residuals are seldom in a perfect straight line, and also here we can see how the extreme points (numbered) are notably on the negative side of the dashed line. Similar behavior can be observed on the upper corner of the plot. Great majority of the standardized residuals do follow the dashed line moderately well.
The last diagnostic plot (Residuals vs Leverage) indicates if there are any influential observations. The leverage of an observation is based on how much the observation’s value on the explanatory variable differs from the mean of the explanatory variable itself. The larger the leverage the more influence the observation potentially have on the estimations of the model parameters. The red dashed line is a Cook’s distance, which measures the influence of an observation on the estimation of all the parameters in the model. Our case is quite typical as there is no influential cases. All the residuals are well inside the Coock’s distance lines (which are actually not even visible in the figure). If there would be some values at the upper and lower right corner one would need to be careful. Those observations would be influential against a regression line.
Summary of my learning
This week’s tasks have been mainly fun and really educating. However, they have required a several days full of work. The given workload is not light for me as I do not have much previous knowledge of the statistical methods. A lot new to learn. The textbook is helpful in this sense but there is a lot of material to scan through. It is good to have real data to work with. It helps to understand what do all the correlations, significance thresholds and diagnostic plots really mean. Especially the diagnostic plots and their interpretations were new to me.
Coding with R is not familiar to me, but my previous coding background with python is somewhat helpful. The DataCamp is extremely supportive and it does pay off to do the DataCamp-exercises carefully first. It takes time but helps with producing my own codes. I got the practicalities work last week and it makes me happy that everything is still working this week. Thus, I can fully focus on the tasks and the analysis.
Regards, Janina
The purpose of the analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. Let us look closer to the cleaned data - collected in 2014 in Portuguese secondary schools from a survey about alcohol consumption and students’ achievements in mathematics (math) and Portuguese language (por) - resulting from the Data Wrangling.
The same answerers of two different classes have been combined. In addition;
date()
## [1] "Thu Dec 09 16:07:09 2021"
#Reading the data
pormath = read.table(file="pormath.txt", sep=",", header=TRUE)
#Structure and dimensions of the data
str(pormath)
## 'data.frame': 370 obs. of 51 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr "R" "R" "R" "R" ...
## $ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr "T" "T" "T" "T" ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr "at_home" "other" "at_home" "services" ...
## $ Fjob : chr "other" "other" "other" "health" ...
## $ reason : chr "home" "reputation" "reputation" "course" ...
## $ guardian : chr "mother" "mother" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr "yes" "yes" "yes" "yes" ...
## $ famsup : chr "yes" "yes" "yes" "yes" ...
## $ activities: chr "yes" "no" "yes" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "yes" "yes" "no" "yes" ...
## $ romantic : chr "no" "yes" "no" "no" ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ n : int 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : int 1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
## $ id.m : int 2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : chr "yes" "no" "no" "no" ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : chr "yes" "no" "no" "no" ...
## $ absences.p: int 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : int 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : int 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : int 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: int 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : chr "yes" "no" "yes" "yes" ...
## $ absences.m: int 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : int 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : int 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : int 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
#Names of the columns
colnames(pormath)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
The structure of the data looks good. There are 370 observations of 51 variables (see colnames above), such as gender, age, information about family’s education, study time, alcohol consumption etc. The entire attribute information you can find from here.
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Drawing a bar plot of each variable
gather(pormath) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
To examine the relationships related to the alcohol consumption, some variables need to be chosen. In this report we have chosen age, absences, going out with friends (goout) and final grade (G3) to be compared with alcohol consumption (alc_use) and whether it is highly used or not (high_use).
My personal hypotheses
Let us begin with a numerical overview of all four chosen variables and high/low alcohol consumption; (high_use) ~ age, goout, absences and final grade (G3):
#Producing summary statistics by group from the chosen variables
#Gender, high/low alcohol consumption, number of observations, mean age
pormath %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_age
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 16.6
## 2 F TRUE 41 16.5
## 3 M FALSE 105 16.3
## 4 M TRUE 70 16.9
The mean age is approximately equal for all females and for males that do not use much alcohol. Only the age of males having a high alcohol consumption is slightly higher. The number of females using a lot of alcohol (41, see females high_use FALSE > 154) is the smallest out of these four categories (females and males with high/low alcohol consumption). Also the male students using alcohol more than 2 times a week (high_use) are in minority (70, see males high_use FALSE > 105).
#Gender, high/low alcohol consumption, number of observations, mean of going out
pormath %>% group_by(sex, high_use) %>% summarise(count = n(), mean_go_out = mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_go_out
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 2.95
## 2 F TRUE 41 3.39
## 3 M FALSE 105 2.70
## 4 M TRUE 70 3.93
The mean of going out with friends seem to be higher for those who have high alcohol consumption, in spite of the gender. In addition, females that have low alcohol consumption go out more often than males with low alcohol consumption.
#Gender, high/low alcohol consumption, number of observations, mean of absences
pormath %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_absences
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 4.25
## 2 F TRUE 41 6.85
## 3 M FALSE 105 2.91
## 4 M TRUE 70 6.1
The absences have clear differences related to high/low alcohol consumption. For both, females and males, having a high alcohol consumption the mean of absences is over 6 times when compared to the low alcohol consumers (see females 4.25 and males 2.91). Females that have high alcohol consumption have more absences than males with high alcohol consumption.
#Gender, high/low alcohol consumption, number of observations, mean grade
pormath %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 11.4
## 2 F TRUE 41 11.8
## 3 M FALSE 105 12.3
## 4 M TRUE 70 10.3
The final grades seem very similar for all females, despite their alcohol use. There is even slight increase in grades among females using alcohol more frequently (11.8 vs. 11.4). For males a clear difference can be observed. Males not using much alcohol have a lot higher grades (12.3) than the ones using alcohol more often (10.3).
Let us continue by drawing a bar plots out of all four chosen variables and alcohol consumption; A age, B goout, C absences, D final grade (G3), E alc_use and F high_use:
library(tidyr); library(dplyr); library(ggplot2); library(cowplot)
#Drawing a bar plot of each variable (alc_use, high_use and 4 chosen variables)
#Initializing a barplot of age
g1 <- ggplot(pormath, aes(x = age, fill = sex)) + geom_bar() #+ ylab("age")
#... of going out with friends (goout)
g2 <- ggplot(pormath, aes(x = goout, fill = sex)) + geom_bar()
#... of absences
g3 <- ggplot(pormath, aes(x = absences, fill = sex)) + geom_bar()
#... of final grade (G3)
g4 <- ggplot(pormath, aes(x = G3, fill = sex)) + geom_bar()
#... of alcohol consumption (alc_use)
g5 <- ggplot(pormath, aes(x = alc_use, fill = sex)) + geom_bar()
#... of whether the consumption is high/low (high_use)
g6 <- ggplot(pormath, aes(x = high_use, fill = sex)) + geom_bar()
#Plotting all 4 barplots
plot_grid(g1, g2, g3, g4, g5, g6, labels="AUTO")
All the bar plots look reasonable. There seems to be more female answerers, which ages are varying mainly between 15-18 years. Males seem to go out with friends just slightly more often than females (goout - bar 5) and there are more males with high alcohol consumption.
Let us now examine the relationships and correlation between chosen 4 variables and alcohol consumption:
library(GGally)
library(ggplot2)
library(dplyr)
#Choosing a handful of columns to keep
keep_columns <- c("sex", "age", "goout", "absences", "G3", "alc_use", "high_use")
#Selecting the 'keep_columns' to create a new dataset
data <- select(pormath, one_of(keep_columns))
#Creating a plot matrix with ggpairs(), colouring has been set by gender (M=blue, F=red)
p <- ggpairs(data, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
#Drawing the plot
p
There are few more female answerers than men. The distribution shown along the diagonal of the plot series are fairly similar between female and male, excluding age and alcohol consumption. The age distribution has higher peaks for females and high alcohol usage is slightly more common among males.
There seems to be high correlation between alcohol consumption and all chosen variables; age (0.16), goout (0.389), absences (0.214), G3 - final grade (-0.156). Only the correlation between final grade and alcohol consumption is negative. These relations follow the hypothesis presented above. However, there are notable differences between two genders. Age and final grades for females do not correlate with alcohol consumption. For male answerers all chosen variables do correlate with alcohol consumption.
According to the correlations the hypotheses are mainly good. I did not make differences between female and male students so that brought couple of surprises. The lack of correlation of age and final grades with alcohol usage for females was not expected. When the genders are combined the observations follow the above-mentioned hypotheses.
Let us still look at the box plots. We will be comparing the chosen variables with whether the student has a high alcohol consumption or not; A age, B goout, C absences, D final grade (G3).
library(cowplot)
#Let us do boxplots out of 4 variable sets
#Initializing a boxplot of high_use and age
g11 <- ggplot(pormath, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + ylab("age")
#... of high_use and going out with friends (goout)
g22 <- ggplot(pormath, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("going out")
#... of high_use and absences
g33 <- ggplot(pormath, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + ylab("absences")
#... of high_use and final grade (G3)
g44 <- ggplot(pormath, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("grade")
#Plotting all 4 boxplots
plot_grid(g11, g22, g33, g44, labels = "AUTO")
In box plots the thicker line in the middle of the box shows the median of the answers. The edges of the box show the lower and upper quartile of the observations. The vertical lines outside the box are called ‘whiskers’ and the dots represent the maximum and minimum answers.
High consumption of alcohol is slightly more common among males with higher age (~17 years). Otherwise the age does not have clear influence according to the box plot. Instead, the correlation between going out with friends and high alcohol usage is clearly visible. The median of goout for females and males is approximately 4 times a week where the students with low alcohol consumption have a median of goout = 3.
Also absences are more common when there is high alcohol usage (positive correlation). The differences may seem small in the box plot, but this is only because the maximum values of absences are far from the median (exceeding even 40 absences). However, the above examined numerical tables and correlations reveal the truth among students that are not missing the classes over 20 times. When looking at the final grades the negative correlation is visible with males, ergo for males the grades are higher when the alcohol consumption is low. For females there are no clear differences perceptible.
Let us do a logistic regression model to statistically explore the relationship between the above chosen four variables and the binary high/low usage of alcohol (target variable).
#Finding the model with glm()
m <- glm(high_use ~ age + goout + absences + G3, data = pormath, family = "binomial")
#Printing out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ age + goout + absences + G3, family = "binomial",
## data = pormath)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8967 -0.7525 -0.5409 0.9467 2.2946
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.77834 1.96186 -1.926 0.05412 .
## age 0.04563 0.11022 0.414 0.67890
## goout 0.70668 0.11917 5.930 3.03e-09 ***
## absences 0.07315 0.02237 3.270 0.00108 **
## G3 -0.04472 0.03923 -1.140 0.25435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 388.67 on 365 degrees of freedom
## AIC: 398.67
##
## Number of Fisher Scoring iterations: 4
#Printing out the coefficients of the model
coef(m)
## (Intercept) age goout absences G3
## -3.77834234 0.04562870 0.70667982 0.07314823 -0.04471930
We will start from the p-values (Pr(>|z|)) of the variables. The z and p-values in the regression table tell whether the pairwise difference between the coefficient of the reference class and the other class is different from zero or not. Variables goout and absences stand out as they are clearly significant in terms of explaining the target variable. Age and final grades (G3) seem to be just outside the level of significance. However, for example, if only one categorical class would be significant, it would not imply that the whole variable is meaningless and should be removed from the model.
Next we will look closer to coefficients of the model as odds ratios. We will also provide a confidence interval for them.
#Computing odds ratios (OR)
OR <- coef(m) %>% exp
#Computing confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
#Printing out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02286056 0.0004637374 1.033922
## age 1.04668570 0.8430071337 1.299954
## goout 2.02724925 1.6139414455 2.577757
## absences 1.07589001 1.0312932049 1.127070
## G3 0.95626587 0.8852281993 1.032878
The odds ratio is used to quantify the relationship between chosen variables and target variable. Odds higher than 1 mean that the chosen variable is associated with TRUE of the target variable (student has high alcohol consumption). The odds ratio is highest for go_out. Odds ratio exceeds one also for age and absences. Final grade is the only which odds ratio is remaining under 1 (~0.96).
Let us look closer to the confidence intervals. In case the confidence interval contains the value one we can conclude that there is no evidence of an association between chosen variable and target variable. In our case this applies for age (0.8 - 1.3) and final grade (0.9 - 1.0). Thus, there is no association between high_use and age or final grade. Instead, both go_out and absences’ confidence intervals are above one, and thereby they do have some association with the high alcohol consumption.
These results would lead to a new conclusion. Even though there has been small correlation visible between age and final grades in terms of high_use, according to odds ratios and confidence intervals they should not be included into the model anymore. Their association is too small or casual. As predicted in the hypothesis going out with friends and absences from lessons do associate with high alcohol consumption. Thus, let us keep these two variables (goout, absences) and move on to explore the predictive power of our model.
Using the variables which, according to the logistic regression model above, had a statistical relationship with high/low alcohol consumption (goout, absences), we will now explore the predictive power of our model.
#Fitting the model
m <- glm(high_use ~ goout + absences, data = pormath, family = "binomial")
#Predicting - predict() - the probability of high_use
probabilities <- predict(m, type = "response")
#Adding the predicted probabilities to 'alc'
alc <- mutate(pormath, probability = probabilities)
#Using the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
#Let us see the last ten original classes, predicted probabilities, and class predictions
select(alc, goout, absences, sex, high_use, probability, prediction) %>% tail(10)
## goout absences sex high_use probability prediction
## 361 3 7 M TRUE 0.28963176 FALSE
## 362 3 3 M TRUE 0.23104046 FALSE
## 363 1 2 M TRUE 0.06029226 FALSE
## 364 4 4 M TRUE 0.40315734 FALSE
## 365 2 3 M FALSE 0.12606082 FALSE
## 366 3 4 M TRUE 0.24487648 FALSE
## 367 2 0 M FALSE 0.10291931 FALSE
## 368 4 4 M TRUE 0.40315734 FALSE
## 369 4 8 M TRUE 0.47825016 FALSE
## 370 2 0 M FALSE 0.10291931 FALSE
Above we have now a table for comparing the real values and the values originating from the probabilities and predictions. In case probability (to use alcohol more than 2 times a week) is 50% (0.5) the prediction results TRUE. In the table above we have last 10 observations of our data and one could say that the predictions do not look too good. 7 out of the 10 observations are in reality categorized as high_use = TRUE, but the predictions are all FALSE.
Let us tabulate the target variable (high_use) versus the predictions to see how well the predictions actually work.
#Let us tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 236 23
## TRUE 65 46
The cases where high_use and prediction both result to FALSE or TRUE are the cases, where observations have been predicted successfully. The contradictions, with TRUE and FALSE or FALSE and TRUE, are the ones that have been predicted wrong (up right value, low left value). As most of the students do not use a lot of alcohol it is reasonable, that FALSE FALSE value is clearly highest. To see better the share of the other cases let us take percentages in use.
#Tabulating the target variable versus the predictions as percentages
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63783784 0.06216216 0.70000000
## TRUE 0.17567568 0.12432432 0.30000000
## Sum 0.81351351 0.18648649 1.00000000
Above we see that the model has got ~64 % correct of the students not using a lot of alcohol. ~19 % have been predicted correctly to have a high alcohol consumption.
The share of the inaccurately classified individuals (miss-predicted observations) - ergo training error - has been calculated using below defined loss_function. The function in practice sums the wrong predictions in the data. The share of training error is ~0.24 (~0.06 + ~0.18 = ~0.24).
#Defining a loss function (mean prediction error)
#Class is here high_use
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
#Calling loss_func to compute the average number of wrong predictions in the data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378
Let us perform a 10-fold cross-validation on our model.
#Defining a loss function (average prediction error)
loss_func2 <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
#Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2432432
The training error of our model is 0.2378378 ~ 0.24. This is slightly less than what the 10-fold cross-validation suggests. This implies that our model has a better test set performance. For some reason the prediction error of 10-fold cross-validation varies somewhat every time it has been ran. The varying happens between (24 % - 26 %)
Summary of my learning
This week’s tasks were more challenging than lats week. Even now, everything is not clear but I think I managed well to complete all the necessary tasks + one bonus task. New plots were used and learned. Using the prediction model and all those odds ratios and 10-fold cross-validations were new and interesting ways to check if the variables do really associate and whether they could be predicted.
Regards, Janina
In this analysis we will be using data set called “Boston”, which can be found from library called “MASS” (source). Let us look closer how does the data look like.
date()
## [1] "Thu Dec 09 16:07:23 2021"
#Accessing the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#Loading the data
data("Boston")
#Exploring the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The Boston data frame has 506 rows and 14 columns. The 14 columns are:
Let us next see the plot matrix of all the variables.
#Plotting matrix of the variables
pairs(Boston)
And then the correlation matrix The matrix is easy way to examine the correlations between all variables.
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
#Calculating the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
#Priunting the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
#Visualizing the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
The correlations exceeding 0.6 are between: crim - rad (0.63), zn - dis (0.66), indus - nox (0.76), indus - age (0.64), indus - dis (0.6), indus - tax (0.72), nox - age (0.73), nox - rad (0.61), nox - tax (0.67), rad - tax (0.91).
Correlations lower than -0.6 are between: indus - dis (-0.71), nox - dis (-0.77), age - dis (-0.75).
For later use, we need to scale our data. In scaling, the column means will be subtracted from the corresponding columns and then the difference will be divided with standard deviation.
As the “Boston” data set contains only numerical values, the whole data needs to be standardized first using function scale().
#Let us center and standardize variables
boston_scaled <- scale(Boston)
#Summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
#The class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
Standardizing the data leads to setting the mean of all the variables to 0. In order to maximize compatibility of the variables.
The class of the scaled Boston data set is an array matrix.
Later on, we will want the data to be a data frame. Thereby, we will next convert the scaled Boston-data set to a data frame format.
#Changing the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
We will be focusing on the crime rate in the Boston (data set). To use and analyse the variable crim(e) we need to create a categorical variable from a continuous one. Thus, let us change crim to a crime (rate per capita by town). We will cut the variable by quantiles (used as break points) to get the high, low and middle rates of crime into their own categories (> categorical variable).
#Creating a quantile vector of 'crim' and printing it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#Creating a categorical variable 'crime' according to the quantiles set in bins
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low","med_low","med_high","high"))
#Let us look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
#Dropping original variable 'crim' from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
#Replacing removed 'crim' with the new categorical value 'crime' to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
The categorized crime data consists of as even categories as possible:
Next we will divide our data set to ‘train’ and ‘test’ sets. 80% of the data will go to the ‘train’ set.
#The number of rows in the Boston data set
n <- nrow(boston_scaled)
#Choosing randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
#Creating 'train' set (80% of the rows)
train <- boston_scaled[ind,]
#Creating 'test' set (20% of the rows)
test <- boston_scaled[-ind,]
#Let us save the correct classes from 'test' data
correct_classes <- test$crime
#And remove the crime variable from 'test' data
test <- dplyr::select(test, -crime)
Now we will apply the linear discriminant analysis on the train set. LDA is a classification (and dimension reduction) method used to find a linear combination of features characterizing 2 or more classes of objects. The resulting combinations may be used as a linear classifier or for dimensionality reduction before later classification.
In the LDA below the categorical crime rate is used as the target variable and all the other variables in the data set act as predictor variables. Then there is a LDA biplot.
#A linear discriminant analysis (LDA)
#Crime rate as target variable, all the others (.) as predictor variables
lda.fit <- lda(crime ~ ., data = train)
#Printing the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2326733 0.2772277 0.2500000 0.2400990
##
## Group means:
## zn indus chas nox rm age
## low 0.9744596 -0.9098180 -0.104792887 -0.8876322 0.4853945 -0.8680449
## med_low -0.1269957 -0.2896550 0.008892378 -0.5600847 -0.1810965 -0.2921296
## med_high -0.3751653 0.2003698 0.195445218 0.4402730 0.1093779 0.4326837
## high -0.4872402 1.0149946 -0.069385757 1.0423878 -0.4531883 0.8039125
## dis rad tax ptratio black lstat
## low 0.8786761 -0.7020851 -0.7307117 -0.4875567 0.38132823 -0.77452569
## med_low 0.3159263 -0.5450436 -0.4769806 -0.0318375 0.31876387 -0.07921127
## med_high -0.3916494 -0.4087748 -0.2886782 -0.3608757 0.04694643 -0.02024186
## high -0.8390339 1.6596029 1.5294129 0.8057784 -0.88156866 0.86153824
## medv
## low 0.57235431
## med_low -0.03773927
## med_high 0.19935953
## high -0.69192804
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12173157 0.660094102 -0.91420474
## indus 0.04302670 -0.227110531 0.22980204
## chas -0.06548305 -0.054965234 0.15016027
## nox 0.21476876 -0.928161508 -1.28178102
## rm -0.12378505 -0.051310938 -0.19196261
## age 0.27334440 -0.305334088 -0.09071289
## dis -0.10659761 -0.310389387 0.07458488
## rad 3.48633098 0.899749430 -0.02330908
## tax -0.01886255 0.007582303 0.35954274
## ptratio 0.12821085 0.025139492 -0.22688665
## black -0.15444489 0.013821395 0.12573932
## lstat 0.13299143 -0.119509268 0.40345910
## medv 0.15883296 -0.360933976 -0.23346860
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9501 0.0357 0.0142
#The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#Target classes as numeric
classes <- as.numeric(train$crime)
#Plotting the results of lda
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
In the graph above we can see, that only LD1 separates the classes nicely. LD2 does not give any valuable information related to the separation. The arrow called ‘rad’ is the longest and directed towards the ‘high’ classes. Also the arrows of variables ‘zn’ and ‘nox’ are differing from the others. ‘zn’ is directed upward and ‘nox’ downward.
Let us try to predict the values based on our model. The data was split earlier on in order to have a test set and the correct class labels. Now we will test how the LDA model performs when predicting the classes on test data.
Below, there is a cross tabulation of the results with the crime categories from the test set.
#Prediction of the classes with test data
lda.pred <- predict(lda.fit, newdata = test)
#Cross tabulation of the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 14 0 0
## med_low 4 7 3 0
## med_high 1 9 14 1
## high 0 0 1 29
The cross tabulation results in table, where correct answers are on the left hand side as rows and predictions as columns. It is good to see that most of the observations have been predicted correctly (numbers in diagonal). However, there are several spots with wrong predictions. The highest number of mistakes in prediction are under ‘med_low’. The model has predicted several low and med_high cases as med_low. The exact number is changing after each run due to the randomnes in the script.
Let us start by calculating the (euclidean and manhattan) distances between observations.
#Reloading MASS and Boston data sets
library(MASS)
data('Boston')
#Scaling and standardizing the data set 'Boston'
boston_scaled1 <- scale(Boston)
#Changing the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled1)
#Euclidean distance matrix
dist_eu <- dist(boston_scaled2)
#Let us see the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#Manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")
#Let us now see the summary of the distances again
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Now we continue to the K-means algorithm. K-means is one of the most used clustering method. The method assigns observations into groups (clusters) based on similarity of the objects. The K-means (function kmeans()) counts the distance matrix (done above) automatically. As there are quite a lot of variables, 5 columns (columns 6 to 10) have been paired up to make examination more clear. The first 5 columns are less informative, and thereby the last 5 columns have been chosen.
#K-means clustering
km <-kmeans(boston_scaled2, centers = 3)
#Plotting the Boston dataset with clusters
#5 columns (columns 6 to 10) have been paired up to make examination more clear
pairs(boston_scaled2[6:10], col = km$cluster)
As K-means takes the amount of clusters in as an argument, we need to look for the optimal amount of clusters. One way of estimating the optimal number of clusters has been tested below. There we look how the total of within cluster sum of squares (WCSS) acts when the number of clusters change.
K-means may produce different results every time, as it randomly assigns the initial cluster centers. The function set.seed() is used to deal with it.
library(ggplot2)
#To ensure that the result does not vary every time
set.seed(123)
#Determining the number of clusters
k_max <- 10
#Calculating the total within sum of squares (twcss)
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
#Visualizing the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal number of clusters is where the total WCSS drops radically. In the figure above, the most radical drop is just in the beginning and the drop ends at value 2. This indicates, that the optimal number of clusters would be 2. Let us run the K-means algorithm again, using the optimal number of clusters (2).
#K-means clustering
km <-kmeans(boston_scaled2, centers = 2)
#Plotting the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)
The different variables cause a different shaped clustering. Variables ‘rad’ and ‘tax’ cause two clearly separated groups (one in the area of -1 - 0 and the other in the range of 1.5).
‘age’ and ‘dis’ cause a more connected clusters, where the first cluster gradually changes to the next one. In ‘dis’ the first cluster starts from -1 and as the value increases the second cluster begins. in ‘age’ the first cluster starts from 1 and as the value decreases the second cluster begins.
Summary of my learning
This week I felt slightly lost. I managed very well to do everything that I was asked to do but the deeper understanding is not fully there. I can make graphs and tables and I can tell what I see… But, for example, what do the arrows (of variables) really tell me in the LDA graph? Can the distances on their own be interpret/useful somehow? How do we utilize the two resulting groups from the clustering? Maybe I’ll learn the answer’s later on. For now, I think I just need to settle on executing the tasks.
Regards, Janina
In this analysis we will be using data set called “human”, which is a combination of the following two data sets:
Meta file for these data sets can be found from here
Let us look closer how does the data look like.
date()
## [1] "Thu Dec 09 16:07:30 2021"
human = read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)
#Using my own table did not work for some reason
#human = read.table(file="human5.txt", sep=",", header=TRUE)
#Let us see the dimensions and structure of the data set
dim(human)
## [1] 155 8
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
There are 155 observations of 8 variables:
Let us next show a graphical overview of the data:
library(GGally)
library(dplyr)
library(corrplot)
#Visualizing the 'human' variables
ggpairs(human)
Above we have used a generalized pairs plot from the GGally package to visualize the data frame.
From the distributions of the variables we can say, that the expected years of life and education are both relatively high. The ratio of labour force participation of females and males in each country shows also a relatively high peak. In turn, the amount of GNI, maternal mortality and adolesent birth rate are relatively low. Also the percent representation of females in parliament is quite low.
#Computing the correlation matrix and visualizing it with corrplot
cor(human) %>% corrplot
To study the linear connections, a cor() function has been used. The output has been visualized with the corrplot function. From the corrplot we can see that:
Next we will move to the PCA
Principal component analysis (PCA) is used in exploratory data analysis and for making predictive models. It is a statistical procedure that decomposes a matrix into a product of smaller matrices and then reveals the most important component.
In PCA, the data is first transformed to a new space with equal or less number of dimensions > new features. These new features are called the principal components, which always have the following properties:
Let us perform the PCA with Singular Value Decomposition (SVD) for our (not standardized) data.
#Performing PCA with the SVD method
pca_human <- prcomp(human)
#Creating a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
#Rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
#The percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
#Creating object pc_lab (by the first two principal components) to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
#Drawing a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
And now we repeat the Principal Component Analysis and its visualization for standardized ‘human’ data.
#Standardizing the variables
human_s <- scale(human)
#Performing PCA with the SVD method
pca_human_s <- prcomp(human_s)
#Creating a summary of pca_human
s_s <- summary(pca_human_s)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
#Rounded percetanges of variance captured by each PC
pca_pr_s <- round(100*s_s$importance[2, ], digits = 1)
#The percentages of variance
pca_pr_s
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
#Creating object pc_lab (by the first two principal components) to be used as axis labels
pc_lab_s <- paste0(names(pca_pr_s), " (", pca_pr_s, "%)")
#Drawing a biplot
biplot(pca_human_s, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_s[1], ylab = pc_lab_s[2])
The importance of components (see tables above): Standard deviation, Proportion of Variance, Cumulative Proportion are all equal to both cases (standardized and not).
However, as we do the rounding of variance captured by each PC notable differences occur. In case the data has not been standardized all the variance is captured by the first PC (PC1). This makes the scaling of the biplot odd and the graph hard to look at and interpret. The Gross National Income (GNI) per Capita is overpowering all the other features and it is contributing to the dimension of PC1 (which makes sense, as all the variation has been captured by PC1).
When the data has been standardized the variance is captured by all 8 PC’s:
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 |
|---|---|---|---|---|---|---|---|
| 53.6% | 16.2% | 9.6% | 7.6% | 5.5% | 3.6% | 2.6% | 1.3% |
The points presented in the beginning of PCA-chapter apply: most of the variance is captured by the firs PC (PC1 - 53.6%) and every PC is less important than the previous one.
From the standardized graph it is easy to interpret the relation between variables:
notes
1. A scatter plot is drawn where the observations are placed on x and y coordinates defined by two first principal components PC1 (x) and PC2 (y)
2. Arrows are drawn to visualize the connections between the original features and the PC’s
Angle between arrows > correlation between the features (small angle = high positive correlation)
Angle between a feature and a PC axis > correlation between the two. (small angle = high positive correlation)
Length of the arrow is proportional to the standard deviation of the feature
In the standardized graph:
There is positive correlation between following features:
1st batch (pointing left): Expected Years of Education, Gross National Income (GNI) per Capita, Ratio of Female and Male populations with secondary education in each country and Life Expectancy at Birth
2nd batch (pointing up): Percent Representation in Parliament and Ratio of labour force participation of females and males in each country
3rd batch (pointing right): Maternal Mortality Ratio and Adolescent Birth Rate
There is no correlation between arrows that are orthogonal:
There is negative correlation between arrows that are pointing to opposite directions:
About the contribution of the principle components:
1st and 3rd batches of arrows are aligned with the x-axis (PC1). So these features are contributing to that dimension.
2nd batch of arrows is aligned with the y-axis (PC2). So these features are contributing to that dimension.
For Multiple Correspondence Analysis, let us first load the ‘tea’ data set from the package ‘Factominer’. The structure of the data is explored and visualized below.
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.2
library(ggplot2)
library(tidyr)
#Loading the data
data("tea")
#Exploring the data set
str(tea) #300 obs. of 36 variables:
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#36 variables is a lot, let us choose couple to examine closer
#Column names to keep in the data set
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
#Selecting the 'keep_columns' to create a new data set
tea_time <- dplyr::select(tea, one_of(keep_columns))
#The summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
#Visualization of the data set
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
There is 300 observations of 36 variables. As 36 variables is quite a lot, only six of them has been taken into closer examination above: how, How, lunch, sugar, Tea and where. It seems that it is most common to drink tea alone using ready-made tea bags (How and how). There is some people drinking tea during lunch time but most of the answerers drink their tea at some other point of the day (lunch). It is nearly as common to drink the tea with sugar as it is not to use sugar (sugar). Earl Grey is clearly most used tea type (compared to black and green tea) (Tea). The tea is most likely from a chain store but also some people do buy their tea from tea shops (where).
Thus, if all the results could be concluded: It seems that it is most common to drink Earl Grey - bought from a chain store - using ready-made tea bags and without sugar alone during some other time of the day than lunch time.
Let us do the Multiple Correspondence Analysis (MCA) next. MCA is a dimensionality reduction method, used to analyse the pattern of relationships of several categorical variables. MCA is a generalization of PCA. The general aim is to condense and present the information of the cross-tabulations of categorical variables in a clear graphical form.
#Multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
#Summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
In the summary of MCA we have first the Eigenvalues. There we can see the variances and the percentage of variances retained by each (11) dimension.
Then we have Individuals. Here we have the 10 first individuals coordinates, individuals contribution (%) on the dimension and the cos2 (the squared correlations) on the dimensions.
The summary contains also Categories. Here we have for the 10 first; the coordinates of the variable categories, the contribution (%), the cos2 (the squared correlations) and v-test value. Note, that the v-test follows normal distribution, ergo in case the value is below/above ± 1.96, the coordinate is significantly different from zero. This applies for most of the categories, excluding alone and other in the first table and green, milk and tea bag+unpackaged in the second table.
Lastly we have Categorical variables. There we have the squared correlation between each variable and the dimensions. Values close to 1 indicate a strong link with the variable and dimension. See how and Dim.1 (0.708); where and Dim.1 (0.702).
#Visualization of MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
Here we have MCA factor map as biplot. The variables have been drawn on the first two dimensions (Dim 1 explaining ~15% and Dim 2 ~14%). The distance between variable categories giver a measure of their similarity (the more close the more similar).
Summary of my learning
This week I got answers to some of the questions I had last week. The explanations how to interpret biplots were especially important and interesting. I think everything went well this week and therefore I am feeling quite exited!
Regards, Janina
This week we have two datasets to investigate (BPRS and RATS). These datasets are longitudinal and thereby the response variable (and maybe even some or all of the explanatory variables) is observed several times > multivariate. The analysis of repeated measures and longitudinal data is in need of certain models that are able to both assess the effects of explanatory variables on the multiple measures of the response variable and account for the likely correlations between these measures.
First part of this week’s data analysis will focus on the RATS dataset. We will be looking at some graphical displays of the data and a possible approach to its’ analysis. Let us look closer how does the data look like.
The data is from a nutrition study that was conducted in three groups of rats in 1990 by Crowder and Hand. The three groups of rats had different types of diets. For 9 weeks each animal’s body weight (in grams) was recorded repeatedly approximately once a week > repeated measured data. The interest was to see whether there were any differences in the growth profiles of the three groups.
date()
## [1] "Thu Dec 09 16:07:39 2021"
RATSL = read.table(file="RATSL.txt", sep=",", header=TRUE)
#Factor ID & Group of RATS
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
#Let us see the dimensions and structure of the dataset
dim(RATSL)
## [1] 176 5
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ times : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
There are 176 observations of 5 variables. First there is the ID-number of the rat and the group the animal belongs into. The variable ‘Times’ has been constructed out of ‘times’ and it tells how many times the weight has been measured. The variable ‘Weight’ is the weight of the rat in grams.
We shall begin by plotting the weight of the rats over time in the three groups (1, 2 and 3).
library(GGally)
library(dplyr)
library(corrplot)
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.1.2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
#Drawing a plot using ggplot
ggplot(RATSL, aes(x = Time, y = Weight, col=Group)) +
geom_point() +
labs(x="Time (d)", y="Weight (g)", size="horsepower",
col="Group")
Above, all the weight profiles of the three groups are presented in one graph. The colouring shows which group the rat is in (1 = orange, 2 = green, 3 = blue). Here it is clearly visible that the rats in group 1 are notably lighter than in the two other groups.
Let us next make a weight profiles following the MABS4IODS - chapter 8:
#Drawing a new plot using ggplot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID, color = ID)) +
geom_point(size = 0.5) +
geom_line(aes(group = ID)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
In the graph above we can first confirm that it really seems that the rats in the first group are lighter than in the other groups. In the second group most of the rats are weighting something between 400 g and 500 g except one rat which is approximately 100 g heavier. Overall the rats in third group are the heaviest, if the one exception from the group 2 is excluded.
In the groups 2 and 3 all the rats are gaining weight. The change in weight is nearly non-existent in the rats of group 1. In every group there is one rat which weight differs from the other rats > an outlier.
The changes in the weight can be seen more clearly when the values have been standardized. In the following graph the weight of the rats have been standardized by subtracting the relevant occasion mean from the original observation and then dividing by the corresponding visit standard deviation.
#Standardising the variable Weight
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdweight = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
#Plotting again with the standardized weight
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID, color = ID)) +
geom_point(size = 0.5) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "Standardized weight")
We can still move on from the standardized weight profiles. Individual profiles are usually not in the main focus and their interpretation is not that meaningful at this point. Instead we will produce a graph showing average profiles for each group at each time point.
Thus, let us do that:
#Number of weighting times
n <- RATSL$Time %>% unique() %>% length()
#Summary data with mean and standard error of Weight by Group and Time
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
#Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, ~
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2~
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.30~
#Plotting the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.95,0.6)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
Above we have all three group averages in the same graph (see labels for the symbols on the right).
There is no overlapping between the groups. Group 1 has notably lighter weight profile which is increasing in time just slightly. The weight profile of group 3 is little heavier than the one of group 2. Overall, the weight profiles of groups 2 and 3 are showing similar increase in time. The error is greatest in group 2 meaning that there is most variance and uncertainty in the results (longest errorbars). The error in the group 1 seems to be smallest.
#Creating a summary data by Group and ID
RATSSbp <- RATSL %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
#Drawing a boxplot of the mean versus group
ggplot(RATSSbp, aes(x = Group, y = mean, col=Group)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "Mean weight")
Above we have a boxplot of mean summary measures for the three groups. The outlier in every group (discussed earlier) can now be seen as a dot (maximum or minimum observation of the group) below the boxes in the cases of groups 1 and 3 and above the box in the case of group 2
The greatest outlier is in group 2 and it has quite large impact on its’ error. Let us next remove the outlier by filtering out the weight exceeding 550 grams.
#Filtering mean weight values exceeding 550 g
RATSSbpf <- RATSSbp %>%
filter(mean < 550)
glimpse(RATSSbpf)
## Rows: 15
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.~
#Drawing a boxplot of the mean versus group using the filtered data
ggplot(RATSSbpf, aes(x = Group, y = mean, col=Group)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "Mean weight")
Note how the boxplots of all three groups are now fairly similar together (size-wise). Removing the outlier makes it easier to see how the diet works for the majority.
Let us next perform a formal test for a difference, ergo apply t-test in order to assess differences between the three groups and to calculate a confidence interval for these differences. We shall use the averaged data without the outlier.
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.1.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.2
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:cowplot':
##
## get_legend
#Summary statistics of the data
RATSSbpf %>%
group_by(Group) %>%
get_summary_stats(mean, type = "mean_sd")
## # A tibble: 3 x 5
## Group variable n mean sd
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 1 mean 8 264. 11.9
## 2 2 mean 3 449. 7.55
## 3 3 mean 4 526. 22.1
#Completing a pairwise t-tests
pwt <- RATSSbpf %>%
pairwise_t_test(mean ~ Group, p.adjust.method = "bonferroni")
pwt
## # A tibble: 3 x 9
## .y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
## 1 mean 1 2 8 3 2.9 e-10 **** 8.71e-10 ****
## 2 mean 1 3 8 4 1.56e-12 **** 4.68e-12 ****
## 3 mean 2 3 3 4 1.79e- 5 **** 5.37e- 5 ****
The outcome is a 3x9 table. The three groups are presented as rows and from the columns we can see the significance. There is a highly significant evidence for the groups to differ from each other.
Let us next fit a linear model with the mean as the response. From that we can compute the analysis of variance table for the fitted model using likelihood ratio test ANOVA.
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep ="\t", header = T)
#Adding the baseline
RATSL_B <- RATSSbp %>%
mutate(baseline = RATS$WD1)
#Fitting the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL_B)
#Computing the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Analysis of Variance Table confirms that we have quite a strong evidence of a treatment differences between the groups, which confirms the outcome of the pairwise t-test. In fact, the differences could have been observed already from the first plots, but it is good to rely on reliable tests.
Now we shall move on to the next dataset called ‘BPRS’. In this dataset 40 males were randomly assigned to one of two treatment groups and each person was rated on the Brief Psychiatric Rating Scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from 1 (not present) to 7 (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
BPRSL = read.table(file="BPRSL.txt", sep=",", header=TRUE)
#Factor treatment & subject of BPRSL
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
#Let us see the dimensions and structure of the dataset
dim(BPRSL)
## [1] 360 5
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
There are 360 observations of 5 variables. First there is the ‘treatment’ of the person, which indicates which one of the two treatments have been used on ‘subject’ (persons’ “ID”-like number). The variable ‘week’ has been constructed out of ‘weeks’ and it tells which week is going on. The variable ‘bprs’ tells the brief psychiatric rating scale of the person.
Let us plot the BPRS against the time (weeks).
#Plotting the BPRS data
ggplot(BPRSL, aes(x = week, y = bprs, col = subject)) +
geom_point(size = 0.5) +
geom_line() + scale_x_continuous(name = "Week", breaks = seq(0, 8, 1)) + scale_y_continuous(name = "BPRS") + theme(legend.position = "top")
Putting all the observations in one figure is not very informative but fairly messy. Let us separate the subjects by the two treatments.
#Plotting the BPRS data once more
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject, col = subject)) +
geom_point(size = 0.5) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
The graphs look a lot clearer now. We can already say that the rate of BPRS is decreasing for nearly all men during the eight weeks period. However, there is also some individual differences and variability. The overall variablity between the observations seems to decrease in time. There is not too much of rapid overlapping but the men whose BPRS is higher in the beginning seem to remain higher throughout the study.
Let us next repeat the plot using standardized values of each observation averaged by the treatment.
#Number of subject
n <- BPRSL$subject %>% unique() %>% length()
#Summary data with mean and standard error of bprs by treatment and week
BPRSS <- BPRSL %>%
group_by(treatment, week) %>%
summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the `.groups` argument.
#Plotting the mean profiles of subjects
ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
geom_line(aes(group = treatment)) +
scale_linetype_manual(values = c(1,2)) +
geom_point(size=4) +
scale_shape_manual(values = c(1,2)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.2) +
theme(legend.position = c(0.95,0.6)) +
scale_y_continuous(name = "mean(BPRS) +/- se(BPRS)")
Above we have both treatment group averages in the same graph (see labels for the symbols on the right).
There is now considerable overlap in the mean profiles. This may suggest that there is some differences between the two treatment groups with respect to the mean BPRS values. However, both treatment groups show decrease in the value of BPRS during the eight weeks period.
Linear mixed effects model is suitable for responses that are assumed to be approximately normally distributed after conditioning on the explanatory variables. This model formalize the idea that an individual’s pattern of responses is likely to depend on many characteristics of that individual (including some that are unobserved). The unobserved variables are included in the model as random variables > random effects. The essential feature of the model is that the (usually positive) correlation among the repeated measurements on the same individual arises from shared, unobserved variables. MABS4IODS - chapter 9
Let us next fit a linear regression model ignoring the repeated measures structure of the data. BPRS will act as response variable, and week + treatment as explanatory variables.
#Creating a regression model RATS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
#Printing a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
In the results above ‘intercept’ and ‘week’ seem to be strongly significant. Instead, treatment 2 does not give any significant difference. However, the model cannot be trusted as we are aware that there is likely correlation between different measurements of the same subject. This correlation will be taken into account as a random factor.
Let us continue by using a random intercept model. The random component is assumed to be normally distributed and constant in time. Thereby, the linear fit for each subject is now allowed to differ in intercept from other subjects.
library(lme4)
## Warning: package 'lme4' was built under R version 4.1.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
#Creating a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
#Printing the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
From the t-values above we can again confirm that there is greater evidence of a significant difference when it comes to ‘intercept’ and ‘week’, whereas treatment 2 remains insignificant. When we compare the standard error of ‘week’ we can observe, that it has decreased slightly from the previous model. This is due to taking the time correlation into account. In addition, the standard error of the ‘treatment 2’ has decreased.
We may still improve our model. The random intercept model does not always represent the observed pattern of variance and correlations between measurements in longitudinal data. Thereby, we will next fit a random intercept and random slope model. This allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the BPRS profiles and also the effect of time. The two random effects are assumed to have a bivariate normal distribution.
#Creating a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
#Printing a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
#Performing an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table resulting from the ANOVA likelihood ratio test we can see that the p-value resulting from the chi-squared statistic is relatively small (0.02636 > 1% significance level). The low value implies that the new model (BPRS_ref1) provides a better fit against the comparison model (BPRS_ref).
Let us still improve the model with one more detail.
The final fit will be done with a random intercept and slope model that allows for a group × week (time) interaction.
#Creating a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
#Printing a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0513 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0049 8.0626
## week 0.9687 0.9842 -0.51
## Residual 96.4702 9.8219
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2522 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
#Performing an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results of the likelihood ratio test of the interaction random intercept and slope model against the previous model we may observe that the competence of the new model is actually worse. The degrees of freedom have decreased from 2 to 1 and the chi-squared p-value is no longer inside the level of significance (from 0.02636 to 0.074, ergo significance worsens from 1% to 5%). Thus, the random intercept and random slope model excluding the treatment x time interaction works better in our case.
Thus, let us continue with the previous model (BPRS_ref1) in question.
Next, using the previous model (the random intercept and random slope model) we will try to find and plot the fitted BPRS profiles, and then compare with the observed values for both of the treatment groups.
#Drawing the plot of BPRS with the observed bprs values
g_obs <- ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject, col = subject)) +
geom_point(size = 0.5) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
#Creating a vector of the fitted values
Fitted <- fitted(BPRS_ref2)
#Creating a new column fitted to BPRS
#BPRSL <- mutate(BPRSL, fitted = Fitted)
BPRSL <- BPRSL %>% mutate(Fitted)
#Drawing the plot of BPRSL with the Fitted values of bprs
g_fit <- ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject, col = subject)) +
geom_point(size = 0.5) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(limits = c(min(BPRSL$Fitted), max(BPRSL$Fitted)))
g_obs; g_fit
The model seems to fit quite well with the observed data. The linear model may not be the best option for this data but from the models tried above this one works the best. From here we can say, that both groups show clear decrease in the values of BPRS in time and that there seems not to be notable differences between the two groups. Both treatments worked equally well.
Summary of my learning
Phew, that was the last week! This demanded quite a lot work from me. It was odd to try to implement the methods to the opposite datasets. Some challenges were faced but I think I managed quite okay. I feel relieved that I got everything done and even though this week’s tasks were quite heavy I actually enjoyed it (after I got the hang of it).
Thank you for the course
Enjoy your holidays, you’ve earned it!
ctree <- function(N=10){
for (i in 1:N) cat(rep("",N-i+1),rep("*",i),"\n")
cat(rep("",N),"*\n")
}
ctree()
## *
## * *
## * * *
## * * * *
## * * * * *
## * * * * * *
## * * * * * * *
## * * * * * * * *
## * * * * * * * * *
## * * * * * * * * * *
## *
Regards, Janina